White Wine Quality Exploration

Abstract

This data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine.
At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

We will explore to find out which chemical properties influence the quality of white wines. Also we will explore the relation within the chemical properties.

Introduction

## [1] FALSE

Univariate Plots Section

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.0             0.27        0.36           20.7     0.045
## 2 2           6.3             0.30        0.34            1.6     0.049
## 3 3           8.1             0.28        0.40            6.9     0.050
## 4 4           7.2             0.23        0.32            8.5     0.058
## 5 5           7.2             0.23        0.32            8.5     0.058
## 6 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

Our dataset consists of 11 Input variables and 1 Output variable, with 4898 observations.

See the discribution of output variable (quality).

We can see most white wine in this dataset has the score of quality in 5, 6 or 7.

Calculate the number of samples in each quality below.

##   Group.1 number_of_sample
## 1       3               20
## 2       4              163
## 3       5             1457
## 4       6             2198
## 5       7              880
## 6       8              175
## 7       9                5

See the discribution of input variable (quality).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

The median of fixed acidity is 6.8 and the mean is 6.855 and this distribution is bell-curve shaped, so we can say fixed acidity data is normally distributed.

Calculate the number of samples in each quality.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

The median of vilatile.acidity is 0.260 and the mean is 0.278. This distribution is a little bit right skewed but also bell-curve shaped.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

This distribution is well bell-curve shaped, but there is a small difference between the median(0.320) and the mean(0.334). Assumably, this is because there is an unusual peak at 0.5 (g/dm^3) and this value dragged the mean to higher value.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

The most of wine has residual sugar between 0 and 20 (g/dm^3). So we are going to create a graph that focus on the x scale below.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

This distribution is fairly right skewed. The median is 5.200 and the mean is 6.391. Many white wines fall into especially between 1 and 2 (g/dm^3) of residual sugar.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

The median is 0.0430 and the mean is 0.0458. Even though there is some outliers betwenn 0.1 and 0.35, this distribution is fairly bell-curved.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

The median is 34.00 and the mean is 35.31. Even though there is some outliers betwenn 100 and 300, this distribution is fairly bell-curved.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

The median is 134.0 and the mean is 138.4. Even though there is some outliers betwenn 300 and 450, this distribution is fairly bell-curved.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

The median and the mean are almost same value, which is 0.9937 and 0.9940 respectively. There are quite few outliers observesd in density. This distribution is fairly bell-curved.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

The median and the mean are almost same value, which is 3.180 and 3.188 respectively, also this distribution is fairly bell-curved and normally distributed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

The median is 0.470 and the mean 0.490. This distribution is fairly bell-curved but a tiny bit right skewed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

There is a peak between 9.25 and 9.5. The distribution is a little right skewed but the data spreads broadly.

Univariate Analysis

What is the structure of this dataset?

There are 4898 white wine data in this dataset with 1 output variable and 11 onput variables.

The output is based on sensory data (median of at least 3 evaluations made by wint experts).

(worst) —–> (best)
quality: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Input variables (based on physicochemical tests):

  • 1 - fixed acidity (tartaric acid - g / dm^3):
    most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
  • 2 - volatile acidity (acetic acid - g / dm^3):
    the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
  • 3 - citric acid (g / dm^3):
    found in small quantities, citric acid can add ‘freshness’ and flavor to wines
  • 4 - residual sugar (g / dm^3):
    the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
  • 5 - chlorides (sodium chloride - g / dm^3):
    the amount of salt in the wine
  • 6 - free sulfur dioxide (mg / dm^3):
    the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
  • 7 - total sulfur dioxide (mg / dm^3):
    amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
  • 8 - density (g / cm^3):
    the density of water is close to that of water depending on the percent alcohol and sugar content
  • 9 - pH:
    describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
  • 10 - sulphates (potassium sulphate - g / dm3):
    a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
  • 11 - alcohol (% by volume):
    the percent alcohol content of the wine

What is/are the main feature(s) of interest in your dataset?

The main feature is quality. I would like to figure out which input features affect the quality the most and whether it’s possible to predict the quality based by some input features.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

All the input features (Fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates) possibly affect the wine quality.

Did you create any new variables from existing variables in the dataset?

No, I did not creat any new variables.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I noticed there is those two unusual distributions:
- The data of residel sugar is concentrated between 1 and 2 (g/dm^3).
- There is an unusual peak at 0.5 (g/dm^3) in citric acid data.

Also the lowest quality score is 3 and the highest is 9.

Bivariate Plots Section

## [1] "fixed.acidity"    "volatile.acidity" "citric.acid"     
## [4] "residual.sugar"   "chlorides"        "quality"

## [1] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [4] "pH"                   "sulphates"            "alcohol"             
## [7] "quality"

From the subset of the data, the following can be said. - Between output(quality) and input varibales
- Higher the citric acid is, the quality tends to be higher.
- Lower the citric acid is, the quality tends to be higher.
- Lower the chlorides is, the quality tends to be higher.
- Lower the density is, the quality tends to be higher.
- Higher the pH is, the quality tends to be higher.
- More than 6 of quality, higher the alcohol is, the quality tends to be higher.

We will take a look into these relationships by seeing mean data (darkred plot in the graphs) and boxplots.

## pf$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2100  0.2575  0.3450  0.3360  0.3850  0.4700 
## -------------------------------------------------------- 
## pf$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1900  0.2900  0.3042  0.4000  0.8800 
## -------------------------------------------------------- 
## pf$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2400  0.3200  0.3377  0.4100  1.0000 
## -------------------------------------------------------- 
## pf$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.270   0.320   0.338   0.380   1.660 
## -------------------------------------------------------- 
## pf$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.2800  0.3100  0.3256  0.3600  0.7400 
## -------------------------------------------------------- 
## pf$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0400  0.2800  0.3200  0.3265  0.3600  0.7400 
## -------------------------------------------------------- 
## pf$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.290   0.340   0.360   0.386   0.450   0.490

Between 7 and 9 of the quality, the amount of citric acid gets higher. But in the other qualities, the citric acid does not have any trend to increase or decrease the quality by its amount.

## pf$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.587   4.600   6.393  10.700  16.200 
## -------------------------------------------------------- 
## pf$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.300   2.500   4.628   7.100  17.550 
## -------------------------------------------------------- 
## pf$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   7.000   7.335  11.500  23.500 
## -------------------------------------------------------- 
## pf$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.700   5.300   6.442   9.900  65.800 
## -------------------------------------------------------- 
## pf$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.700   3.650   5.186   7.325  19.250 
## -------------------------------------------------------- 
## pf$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.800   2.100   4.300   5.671   8.200  14.800 
## -------------------------------------------------------- 
## pf$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.60    2.00    2.20    4.12    4.20   10.60

The amount of residual sugar seems a little correlated to the quality. From this plot, we could say as the residual sugar decreases the quality increases all over the data. But between 4 - 5 and 7 - 8 of the quality, as the residual sugar increase, the quality increases. So to conclude this legitimacy, we would need to collect more data on wine which should be rated to 4, 8, 9 of quality.

## pf$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400 
## -------------------------------------------------------- 
## pf$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0130  0.0380  0.0460  0.0501  0.0540  0.2900 
## -------------------------------------------------------- 
## pf$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600 
## -------------------------------------------------------- 
## pf$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500 
## -------------------------------------------------------- 
## pf$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500 
## -------------------------------------------------------- 
## pf$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100 
## -------------------------------------------------------- 
## pf$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0180  0.0210  0.0310  0.0274  0.0320  0.0350

The quality is correlated relatively strong to the amount of chlorides. As the chlorides decreases, the quality increases.

## pf$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9911  0.9925  0.9944  0.9949  0.9969  1.0001 
## -------------------------------------------------------- 
## pf$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9892  0.9926  0.9941  0.9943  0.9958  1.0004 
## -------------------------------------------------------- 
## pf$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9872  0.9933  0.9953  0.9953  0.9972  1.0024 
## -------------------------------------------------------- 
## pf$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9876  0.9917  0.9937  0.9940  0.9959  1.0390 
## -------------------------------------------------------- 
## pf$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9906  0.9918  0.9925  0.9937  1.0004 
## -------------------------------------------------------- 
## pf$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9903  0.9916  0.9922  0.9935  1.0006 
## -------------------------------------------------------- 
## pf$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9897  0.9898  0.9903  0.9915  0.9906  0.9970

As the density decrease, the quality increases. As I found below, density is related to the amount of alcohol, so we could say this tendency might just be influenced by the amount of alcohol.

## pf$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.870   3.035   3.215   3.188   3.325   3.550 
## -------------------------------------------------------- 
## pf$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.830   3.070   3.160   3.183   3.280   3.720 
## -------------------------------------------------------- 
## pf$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.790   3.080   3.160   3.169   3.240   3.790 
## -------------------------------------------------------- 
## pf$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.080   3.180   3.189   3.280   3.810 
## -------------------------------------------------------- 
## pf$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.840   3.100   3.200   3.214   3.320   3.820 
## -------------------------------------------------------- 
## pf$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.940   3.120   3.230   3.219   3.330   3.590 
## -------------------------------------------------------- 
## pf$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.200   3.280   3.280   3.308   3.370   3.410

The quality is a little correlated to pH. As the pH value increases, the quality increases.

## pf$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.55   10.45   10.35   11.00   12.60 
## -------------------------------------------------------- 
## pf$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.10   10.15   10.75   13.50 
## -------------------------------------------------------- 
## pf$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## -------------------------------------------------------- 
## pf$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## pf$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## -------------------------------------------------------- 
## pf$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.64   12.60   14.00 
## -------------------------------------------------------- 
## pf$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90

The correlation between quality and alcohol is strong especially when the alcohol is more than 10.5%.

I would like to take a look on the relation between sulphates and quality, because sulphates is an additive, presumably it could affect the quality lower.

## pf$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2800  0.3800  0.4400  0.4745  0.5425  0.7400 
## -------------------------------------------------------- 
## pf$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2500  0.3800  0.4700  0.4761  0.5400  0.8700 
## -------------------------------------------------------- 
## pf$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2700  0.4200  0.4700  0.4822  0.5300  0.8800 
## -------------------------------------------------------- 
## pf$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2300  0.4100  0.4800  0.4911  0.5500  1.0600 
## -------------------------------------------------------- 
## pf$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4800  0.5031  0.5800  1.0800 
## -------------------------------------------------------- 
## pf$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2500  0.3800  0.4600  0.4862  0.5850  0.9500 
## -------------------------------------------------------- 
## pf$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.360   0.420   0.460   0.466   0.480   0.610

From this plot, it seems there is almost no trend that affects the quality by sulphates.

Also from the ggpair praphs, regarding input variables:
- There are some moderate correlations, which are observed between free.sulfur.dioxide and total.sulfur.dioxide, density and total.sulfur.dioxide, and alcohol and density.

## [1] 0.615501

As the free sulfur deoxiside increases, the total sulfur dioxide increases. (Colleration coefficient = 0.6155)

## [1] 0.5298813

As the total sulfur deoxiside increases, the density increases. (Colleration coefficient = 0.5299) This implies the density of sulfur deoxiside compounds is higher than the other compounds in white wine.

## [1] -0.7801376

As the total alcohol increases, the density decreases. (Colleration coefficient = -0.7801)

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The correlation between quality and alcohol is strong especially when the alcohol is more than 10.5%.
Also as the density decrease, the quality increases. As I found below, density is related to the amount of alcohol, so we could say this tendency might just be influenced by the amount of alcohol.

The amount of residual sugar seems a little correlated to the quality. Even thoughthe orders of residuak suger change between 4 - 5 and 7 - 8 of the quality, generally as the residual sugar decreases, the quality increases.

The quality is correlated relatively strong to the amount of chlorides. As the chlorides decreases, the quality increases.

The quality is correlated to pH. As the pH value increases, the quality increases.

We can say there was almost no influence on the quality by the amount of citric acid, because between the 3 and 8 of the quality, the values have no order or trend, even though the amount at quality 9 was higher than other values.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

I found there is a positive correlation between the total sulfur deoxidide and free sulfur deoxidide, whose R^2 value was 0.616. According to the given information about this dataset, the total sulfur deoxidide contains free sulfur deoxidide, so it makes sense.
As the amount of total sulfur deoxidide increases, the density increases.
As the amount of alcohol increases, the density decreases. Assumably this is because the density of alcohol (ethanol) is smaller than the density of water.
This makes sense because density is a dependent variable that is changeable by its compounds such as water, alcohol and sulfur deoxidides.

Also, some wine data which are rated as 3, 4, 8, 9 are not as many as 5, 6, 7, so these small samples could be causing the wrong trend. Especially between the quality 3 and 5, the trend was opposite compared with the trend on the other qualities in some input valiables. (Residual sugar, Density, Alcohol)

What was the strongest relationship you found?

The quality of white wine is the most strongly correlated to the amount of alcohol.
The amount of residue sugar, chlorides and pH are also correlated to the quality.

Multivariate Plots Section

It is clear as the alcohol increases the quality gets higher, but not so clear in the residual sugar.

## <ScaleContinuousPosition>
##  Range:  
##  Limits:    0 --    1

It is still clear as the amount of chlorides decreases, the quality gets higher.

It is hard to see by this plot the correlation between the pH and the quality, even though by the box plot the correlation is clear.

We can see clearly as the density decrease, the quality gets higher.

There is a outlier around 67 in residual sugar, so we adjust the x-axis to close up the most of data. (It is also applied to residual sugar vs. pH graph.)

Even though the residual sugar data is spreaded over the quality, we can see the data of higher quality tends to exist in lower residual sugar.

There are some outliers in chlorides, so we adjust the x-axis to close up the most of data.

We can see when the pH is high and the amount of chlorides is low, the quality is high. Also, when the amount of chlorides is more than 0.10, all the pH data are less than 3.3.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

From the plot of the chlorides vs. alcohol, it is observed clearly that as the amount of chlorides decreases, the quality increases. Also it is very visible higher alcohol amounts have better quality.
From the graph of residual sugar vs. pH, Although it is a little, we can see that lower amount of residual sugar has higher qualities.

After seeing the graph of pH, I found thatpH does not have a strong correlation with the quality.

Were there any interesting or surprising interactions between features?

In the boxplot of residual sugar vs. quality, we could read higher pH had a higher quality, but in its scatter plot, it is difficult to read there is such a clear tendency.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

No, I did not create any models.


Final Plots and Summary

Plot One

Description One

The indicator of a quality is from 0 to 10, but actual data ranges from 3 to 9. The grade 3 has 20 data and the grade 9 only has 5 data. Most of the data is grade 6 (2198 data). The small sample numbers could limit the reliablity of the observation, because they could be biased.

Plot Two

Description Two

From this graph, it is clear to see density and alcohol have a liner relation (corr = 0.78). The boudary of each quality by alcohol is more clear than the boundary by density. For example, between 8% - 10% of alcohol level the majority of the quality is 5, between 10% - 12%, the majority of the quality is 6, between 12% - 13%, the majority of the quality is 7, between 13% - 14%, the majority of the quality is 8. On the other hand, in the density axis, it is difficult to find a clear band that separates quality.

Plot Three

Description Three

As the chlorides level is getting lower, the quality gets higher. While between 0.01 and 0.1 in chlorides, the quality varies in the range from 4 to 9, between 0.1 and 0.3 in chlorides, the most of quality are in the range from 3 and 6.


Reflection

This data set contains information about 4898 white wines across 11 input attributes and 1 output attribute(quality).

We investigated the correlation between quality and input variables and between input variables.

There was a relatively strong relation between quality and alcohol, density, chlorides.
In other words, better quality white wines tend to have higher alcohol percentage, lower density and lower chlorides.

Also, alcohol and density has a liner relation, assumably it is because the density of ethanol is lower than water.

We have to be aware of that there are some limitations of this analysis:
First, We have limited samples. Especially the sample number of quality 3 and 9 is very small. To get more accurate insights from these dataset, we would need to collect more data for these qualities.
Second, there is no other information that could influence the quality such as grape types, wine brand, wine selling price, etc, due to privacy and logistic issues.